Bank Marketing Data Analysis

1. Introduction

1-1. About Dataset

Our dataset is related to telemarketing campaigns held by a Portuguese banking institution. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit, our variable 'y') would be ('yes') or not ('no') subscribed.

1-2. Problem/Task description

Main Problem Statement: Predict if the bank client will subscribe a term deposit(yes/no, binary classification) based on the client's basic information, telemarketing campaign data, and social & economic data.

Purpose of the project (Why we need this project?):
In the process of building this prediction model, we can also identify what type of clients are more likely to respond positive to the marketing campaign (=identify the factors that makes bank marketing more successful). Nowadays, there are so many new ways of marketing but the one thing all companies always expect from marketing campaign is "Maximum revenue(outcome) with minimum cost that would be wasted on approaching non(negative)-responding clients". Especially, in direct marketing such as telemarketing, it's sometimes costly and the response rate is quite low. Therefore, 'Targeting the right client' must be one of the most important things in this kind of direct marketing. Considering this, main goal of this project is to make a prediction model which can suggest a proper direction, right target client, and strategy for successful cross-selling of the bank products such as term deposit.

Hypothesis & Research Questions:

  • Client with loan has less chance to subscribe a term deposit than the other.
  • Client with ceratin type of 'job' has more chance to subscribe a term deposit than the other.

1-3. Feature Explanation

Input variables:

Client Personal Info Variables:

1 - age (numeric)
2 - job : type of job (categorical:'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')

Variables with other attributes:

12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

Social & Economic Variables:

16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (Target):

21 - y - has the client subscribed a term deposit? (binary: 'yes','no')

Import Libraries

You can unfold the code if you want to check what libraries are used

suppressPackageStartupMessages({
library(tidyverse)
library(rapportools)
library(data.table) # datafile reading
library(dplyr)
library(dlookr)
library(magrittr)
library(tidyr)
library(lubridate)
library(DT)
library(gmodels)
library(sqldf)
library(caret)      # rocr analysis
library(ROCR)       # rocr analysis
library(kableExtra) # nice table html formating 
library(gridExtra)  # arranging ggplot in grid
library(rpart)      # decision tree
library(rpart.plot) # decision tree plotting
library(caTools)    # split 
library(randomForest) #randomforests
library(class) # knn
library(ipred) # bagging
library(adabag) # boosting
library(purrr)
library(fastDummies)
library(DMwR)
  
#visualization
library(ggplot2)
library(ggmosaic)
library(corrplot)
library(ggthemes)
library(sqldf)
library(readr)
library(plotly)
library(cowplot)
library(plotROC)
library(ggpubr)
    
library(C50)
library(e1071)
library(pROC)
library(PRROC)
    
library(rattle)
library(RColorBrewer)
    
library(forecast)
library(zoo)
library(DescTools)
library(knitr)
library(VIM)
library(mice)
library(psych)
library(gmodels)
library(caTools)
library(Epi)
library(tidyverse)
})

Load Dataset

bank <- read.csv(file = "/Users/jiyoungkim/Desktop/SGH 2 sem/SLM class/Project/bank-additional/bank-additional-full.csv", 
                 sep =";", stringsAsFactors = T)
str(bank)
'data.frame':   41188 obs. of  21 variables:
 $ age           : int  56 57 37 40 56 45 59 41 24 25 ...
 $ job           : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ...
 $ marital       : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ...
 $ education     : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ...
 $ default       : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1 2 1 1 ...
 $ housing       : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1 1 3 3 ...
 $ loan          : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1 1 1 1 ...
 $ contact       : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
 $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ duration      : int  261 149 226 151 307 198 139 217 380 50 ...
 $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
 $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ emp.var.rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
 $ cons.price.idx: num  94 94 94 94 94 ...
 $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
 $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...
 $ nr.employed   : num  5191 5191 5191 5191 5191 ...
 $ y             : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
summary(bank)
      age                 job            marital     
 Min.   :17.00   admin.     :10422   divorced: 4612  
 1st Qu.:32.00   blue-collar: 9254   married :24928  
 Median :38.00   technician : 6743   single  :11568  
               education        default         housing           loan      
 university.degree  :12168   no     :32588   no     :18622   no     :33950  
 high.school        : 9515   unknown: 8597   unknown:  990   unknown:  990  
 basic.9y           : 6045   yes    :    3   yes    :21576   yes    : 6248  
      contact          month       day_of_week    duration     
 cellular :26144   may    :13769   fri:7827    Min.   :   0.0  
 telephone:15044   jul    : 7174   mon:8514    1st Qu.: 102.0  
                   aug    : 6178   thu:8623    Median : 180.0  
    campaign          pdays          previous            poutcome    
 Min.   : 1.000   Min.   :  0.0   Min.   :0.000   failure    : 4252  
 1st Qu.: 1.000   1st Qu.:999.0   1st Qu.:0.000   nonexistent:35563  
 Median : 2.000   Median :999.0   Median :0.000   success    : 1373  
  emp.var.rate      cons.price.idx  cons.conf.idx     euribor3m    
 Min.   :-3.40000   Min.   :92.20   Min.   :-50.8   Min.   :0.634  
 1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7   1st Qu.:1.344  
 Median : 1.10000   Median :93.75   Median :-41.8   Median :4.857  
  nr.employed     y        
 Min.   :4964   no :36548  
 1st Qu.:5099   yes: 4640  
 Median :5191              
 [ reached getOption("max.print") -- omitted 4 rows ]

2. Data Preprocessing & Cleaning

2-1. Duplicate Rows

sum(duplicated(bank))
[1] 12

Let's drop all these duplicated rows

bank = bank %>% distinct

2-2. Missing values

As checking the missing values, NA is not the only missing ones, we actually need to check several possible cases as below:

  • NA values
  • Blank/Space values
  • Unknown values
sum(is.na(bank)) # turns out there is no NA
[1] 0
sum(is.empty(" ", trim = FALSE)) # turns out there is no Blank/Space values
[1] 0

Even if we don't have any NA or blank values, the table below shows that there are still some variables with a large portion of 'unknown' values

colSums(bank=="unknown")
           age            job        marital      education        default 
             0            330             80           1730           8596 
       housing           loan        contact          month    day_of_week 
           990            990              0              0              0 
      duration       campaign          pdays       previous       poutcome 
             0              0              0              0              0 
  emp.var.rate cons.price.idx  cons.conf.idx      euribor3m    nr.employed 
             0              0              0              0              0 
             y 
             0 

It's such a large number to just drop all of those 'unknown' values. Considering that those values take around 30% portion out of whole obs((12718/41188)*100 = 30.88%), deleting all of them at once is quite huge sacrifice and may even affect the distribution of our Target Variable 'y'./
Therefore, I would rather decide how to deal with these missing values only after I check the distribution of 'y' in those 'unknown' values by each variable. And then, I will check the frequency distribution within each variable who has 'unknown' to decide ideal value for imputation.

2-3. Outliers

Diagnose outlier of numerical variables (in here, with_mean = arithmetic average of including outliers in the variable)

diagnose_outlier(bank) %>% 
  mutate(outliers = outliers_mean / with_mean) %>% 
  arrange(desc(outliers))
        variables outliers_cnt outliers_ratio outliers_mean     with_mean
1        previous         5625      13.660870      1.266489    0.17301341
2        campaign         2406       5.843210     11.004156    2.56787935
3        duration         2963       7.195939    968.217010  258.31581504
4             age          468       1.136584     76.940171   40.02380027
5   cons.conf.idx          446       1.083155    -26.900000  -40.50286332
6           pdays         1515       3.679328      6.014521  962.46480960
7    emp.var.rate            0       0.000000           NaN    0.08192151
8  cons.price.idx            0       0.000000           NaN   93.57571989
9       euribor3m            0       0.000000           NaN    3.62129345
10    nr.employed            0       0.000000           NaN 5167.03486983
    without_mean    outliers
1     0.00000000 7.320177778
2     2.04433841 4.285308922
3   203.27074556 3.748190987
4    39.59939078 1.922360456
5   -40.65181684 0.664150576
6   999.00000000 0.006249082
7     0.08192151         NaN
8    93.57571989         NaN
9     3.62129345         NaN
10 5167.03486983         NaN

Based on the 'outliers_ratio' and outliers score (= outliers_mean/with_mean), 'previous', 'campaign', 'duration' seems top 3 variables with quite severe outliers. Let's decide either Delete or Transform by looking at the distribution plot of these 3 variables.

bank %>%
  plot_outlier(diagnose_outlier(bank) %>% 
                 filter(outliers_ratio >= 5) %>% 
                 select(variables) %>% 
                 unlist())

Summary of Outliers Analysis
1) Variable 'duration': This variable represents the last contact duration, in seconds (numeric). However, as warned, this variable is inappropriate to include as one of X variables for our prediction model because it's not possible to know the duration before a call is performed. Therefore, we will just delete this variable.
2) Variable 'campaign': This variable represents the number of contacts performed during this campaign and for this client (including previous contacts). Even if there are some values including last contacts, it's too much or not very useful to have the contacts more than 10 times for one client, in marketing aspect. Therefore, I decided to drop all the values >10 in this variable.
3) Variable 'previous': Most of the value is concentrated on '0' which means that there were no any other contacts before this campaign and for this client. So, I decided to binning this variable into a categorical variable with 2 level like No (0) / Yes(1) (in the meaning of was there any previous contact or no) since there are so small proportion of value over 1 time! Accordingly, their data type will also be changed from numeric to the factor with 2 levels

2-3.1. Transform variables based on Outliers analysis results

# Drop the variable 'duration'
bank_final = bank %>% 
  select(-duration)

# Drop all the values > 10 within the variable 'campaign'
bank_final = bank_final %>% 
  filter(campaign <= 10)

# Binning the variable 'previous' (num -> binary categorical variable)
bank_final = bank_final %>% 
  mutate(previous = if_else(previous >= 1, "1", "0"))
bank_final$previous <- as.factor(bank_final$previous)

3. EDA

Now, let's deal with additional Data Preprocessing & Cleaning step (like dealing with missing data) aligned with EDA part.

3-1. Univariate, Bivariate Analysis

Target Variable

We can see that our data is highly imbalanced like Yes:No = 11%:89% which means that we have too small proportion of 'Yes' to use this data for modelling. Therefore, after EDA part, as we prepare our dataset for modelling, we will use SMOTE (Synthetic Minority Over-Sampling Technique) to deal with this Imabalanced data issue.

ggplot(data = bank_final, aes(x=y, label = scales::percent(prop.table(stat(count)))))+
  geom_bar(stat = "count", aes(fill=y))+
  geom_text(stat = 'count', position = position_dodge(.9), vjust = -0.5, size = 3.5) +
  scale_y_continuous(labels = scales::percent) +
  stat_count(geom = "text", colour = "black", size = 3.5, aes(label = ..count..),position=position_stack(vjust=0.5))+
  labs(title="Distribution of y variable (Deposit)", x="Deposit", y="Count") + scale_fill_brewer(palette="Set3")

Categorical Variables

Just as an overview, let's start with checking the levels with the biggest proportion within each categorical variables.

diagnose_category(bank_final) %>%
  filter(rank == 1)
# A tibble: 12 x 6
   variables   levels                N  freq ratio  rank
   <chr>       <fct>             <int> <int> <dbl> <int>
 1 job         admin.            40307 10172  25.2     1
 2 marital     married           40307 24386  60.5     1
 3 education   university.degree 40307 11904  29.5     1
 4 default     no                40307 31943  79.2     1
 5 housing     yes               40307 21149  52.5     1
 6 loan        no                40307 33217  82.4     1
 7 contact     cellular          40307 25732  63.8     1
 8 month       may               40307 13592  33.7     1
 9 day_of_week thu               40307  8389  20.8     1
10 previous    0                 40307 34691  86.1     1
11 poutcome    nonexistent       40307 34691  86.1     1
12 y           no                40307 35695  88.6     1

Based on the result, we can glimpse what kind of attributes are represented by our clients pool. For example, we can see a large portion of clients who have admin jobs, married, university degree, no credit in default, housing loan, no personal loan, had cellularphone for contact, last contact in May, last contact in thursday, and had no previous contact before this campaign

1) Job

job_y <- sqldf(
"select job, total, deposit_yes, deposit_no, round(100.*deposit_yes/total, 2) as yes_ratio, round(100.*deposit_no/total, 2) as no_ratio
from(
select job, count(*) as total, sum(case when y = 'yes' then 1 else 0 end) as deposit_yes, sum(case when y = 'no' then 1 else 0 end) as deposit_no 
from bank_final
group by 1
) order by yes_ratio DESC")

job_y
             job total deposit_yes deposit_no yes_ratio no_ratio
1        student   868         275        593     31.68    68.32
2        retired  1684         432       1252     25.65    74.35
3     unemployed   990         144        846     14.55    85.45
4         admin. 10172        1342       8830     13.19    86.81
5     management  2868         328       2540     11.44    88.56
6        unknown   321          36        285     11.21    88.79
7     technician  6611         728       5883     11.01    88.99
8  self-employed  1389         148       1241     10.66    89.34
9      housemaid  1032         106        926     10.27    89.73
10  entrepreneur  1424         124       1300      8.71    91.29
11      services  3886         322       3564      8.29    91.71
12   blue-collar  9062         627       8435      6.92    93.08

In the table above, you can see the 'unknown' level (which is regarded as kind of 'missing value') shows nothing unique or insightful in here. They have similar distribution like other levels and nothing special, so it will not hurt our dataset very much by dropping this unknown values in 'job' variable. Let's drop it.

bank_final = bank_final %>% filter(job != "unknown")
bank_final$job <- factor(bank_final$job)

Below plot is the overall distribution of variable 'job' after dropping those 330 unknown values. (visualized plot and data table)

# Set up default parameters for mosaic plots
mosaic_theme = theme(axis.text.x = element_text(angle = 90,
                                                hjust = 1,
                                                vjust = 0.5),
                     axis.text.y = element_blank(),
                     axis.ticks.y = element_blank())

job_vis <- bank_final %>% 
  ggplot() +
  geom_mosaic(aes(x = product(y, job), fill = y)) + mosaic_theme +
  xlab("job") + scale_fill_brewer(palette="Set2")

ggplotly(job_vis)
job_y_new <- sqldf(
"select job, total, deposit_yes, deposit_no, round(100.*deposit_yes/total, 2) as yes_ratio, round(100.*deposit_no/total, 2) as no_ratio
from(
select job, count(*) as total, sum(case when y = 'yes' then 1 else 0 end) as deposit_yes, sum(case when y = 'no' then 1 else 0 end) as deposit_no 
from bank_final
group by 1
) order by yes_ratio DESC")

job_y_new
             job total deposit_yes deposit_no yes_ratio no_ratio
1        student   868         275        593     31.68    68.32
2        retired  1684         432       1252     25.65    74.35
3     unemployed   990         144        846     14.55    85.45
4         admin. 10172        1342       8830     13.19    86.81
5     management  2868         328       2540     11.44    88.56
6     technician  6611         728       5883     11.01    88.99
7  self-employed  1389         148       1241     10.66    89.34
8      housemaid  1032         106        926     10.27    89.73
9   entrepreneur  1424         124       1300      8.71    91.29
10      services  3886         322       3564      8.29    91.71
11   blue-collar  9062         627       8435      6.92    93.08

Insights from 'Job' variable
- As we can see above, clients who are student(31.7%) or retired(25.7%) had subscribed the 'Deposit (yes)' relatively more than clients with other jobs.
- Since we have high positive deposit subscription ratio from retired clients, it might be also good idea to think of creating another cross-selling product (it can be loan, fund, etc...) customized to retired clients.
- One of our hypothesis that we set at the beginning, "Client with ceratin type of 'job' has more chance to subscribe a term deposit than the other." turns out to be true. And the result shows the clients who are student or retired or unemployed are more likely to subscribe the Deposit.

2) Marital

marital_y <- sqldf(
"select marital, total, deposit_yes, deposit_no, round(100.*deposit_yes/total, 2) as yes_ratio, round(100.*deposit_no/total, 2) as no_ratio
from(
select marital, count(*) as total, sum(case when y = 'yes' then 1 else 0 end) as deposit_yes, sum(case when y = 'no' then 1 else 0 end) as deposit_no 
from bank_final
group by 1
) order by yes_ratio DESC")

marital_y
   marital total deposit_yes deposit_no yes_ratio no_ratio
1   single 11254        1596       9658     14.18    85.82
2  unknown    68           9         59     13.24    86.76
3 divorced  4506         471       4035     10.45    89.55
4  married 24158        2500      21658     10.35    89.65

Also in the 'marital' variable, 'unknown' level doesn't give any special insights and it's really small portion as 68 obs. So, it doesn't harm our data distribution even if we delete it.

marital_v <- bank_final %>% 
  ggplot() +
  geom_mosaic(aes(x = product(y, marital), fill = y)) + mosaic_theme +
  xlab("marital status") + ylab("Deposit Subscription") + scale_fill_brewer(palette="Set2")

ggplotly(marital_v)
marital_y_new <- sqldf(
"select marital, total, deposit_yes, deposit_no, round(100.*deposit_yes/total, 2) as yes_ratio, round(100.*deposit_no/total, 2) as no_ratio
from(
select marital, count(*) as total, sum(case when y = 'yes' then 1 else 0 end) as deposit_yes, sum(case when y = 'no' then 1 else 0 end) as deposit_no 
from bank_final
group by 1
) order by yes_ratio DESC")

marital_y_new
   marital total deposit_yes deposit_no yes_ratio no_ratio
1   single 11254        1596       9658     14.18    85.82
2 divorced  4506         471       4035     10.45    89.55
3  married 24158        2500      21658     10.35    89.65

Insights from 'Marital' variable
As you can see, there is no severe difference in Deposit subscription ratio between single(14.18%), divorced(10.45%), married(10.35%) clients. Single clients are only slightly more subscribing Deposit.

3) Education

education_y <- sqldf(
"select education, total, deposit_yes, deposit_no, round(100.*deposit_yes/total, 2) as yes_ratio, round(100.*deposit_no/total, 2) as no_ratio
from(
select education, count(*) as total, sum(case when y = 'yes' then 1 else 0 end) as deposit_yes, sum(case when y = 'no' then 1 else 0 end) as deposit_no 
from bank_final
group by 1
) order by yes_ratio DESC")

education_y
            education total deposit_yes deposit_no yes_ratio no_ratio
1          illiterate    18           4         14     22.22    77.78
2             unknown  1556         231       1325     14.85    85.15
3   university.degree 11832        1645      10187     13.90    86.10
4 professional.course  5118         591       4527     11.55    88.45
5         high.school  9249        1024       8225     11.07    88.93
6            basic.4y  4028         420       3608     10.43    89.57
7            basic.6y  2225         187       2038      8.40    91.60
8            basic.9y  5892         465       5427      7.89    92.11

There are two things we need to preprocess based on above table result.
1) Need to drop 'illiterate' variable which has risk of giving wrong influence on our modelling with extremely small obs. I will drop it.
2) Different from previous categorical variables, 'Education' variable actually has quite large amount of 'unknown' level(1556) and Deposit (yes) ratio within it is also quite high comparing to other level. Therefore, we will integrate this unknown level into the university.degree which has 3rd highest Deposit(yes)_ratio right below unknown and has very similar distribution of y with unknown level.

# Drop 'illiterate' level
bank_final = bank_final %>% filter(education != "illiterate")
# Integrate 'unknown' level into 'university.degree' level
bank_final = bank_final %>% 
  dplyr::mutate(education = recode(education, "unknown" = "university.degree"))

bank_final$education <- factor(bank_final$education)

After preprocessing those 2 points, the distribution looks like below.

education_y_new <- sqldf(
"select education, total, deposit_yes, deposit_no, round(100.*deposit_yes/total, 2) as yes_ratio, round(100.*deposit_no/total, 2) as no_ratio
from(
select education, count(*) as total, sum(case when y = 'yes' then 1 else 0 end) as deposit_yes, sum(case when y = 'no' then 1 else 0 end) as deposit_no 
from bank_final
group by 1
) order by yes_ratio DESC")

education_y_new
            education total deposit_yes deposit_no yes_ratio no_ratio
1   university.degree 13388        1876      11512     14.01    85.99
2 professional.course  5118         591       4527     11.55    88.45
3         high.school  9249        1024       8225     11.07    88.93
4            basic.4y  4028         420       3608     10.43    89.57
5            basic.6y  2225         187       2038      8.40    91.60
6            basic.9y  5892         465       5427      7.89    92.11
education_vis<- ggplot(education_y_new, aes(education, yes_ratio)) +
  geom_point(aes(color = education, size = total)) +
  theme(axis.text.x = element_text(angle = 45, hjust=1)) +
  labs(title = "Deposit(yes) ratio by Education",
        x = "Education", y = "Deposit(yes) ratio(%)") + theme(text = element_text(face = "bold")) +
  scale_colour_brewer()

ggplotly(education_vis)

Insights from 'education' variable
Clients with university degree has the highest ratio of Deposit Subscription followed by professional.course and high school.

4) Default

# setting default parameters for crosstables
crosstable_function = function(df, var1, var2){
  # df: dataframe containing both columns to cross
  # var1, var2: columns to cross together.
  CrossTable(df[, var1], df[, var2],
             prop.r = T,
             prop.c = F,
             prop.t = F,
             prop.chisq = F,
             dnn = c(var1, var2))
}


crosstable_function(bank_final, "default", "y")

 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|-------------------------|

 
Total Observations in Table:  39900 

 
             | y 
     default |        no |       yes | Row Total | 
-------------|-----------|-----------|-----------|
          no |     27564 |      4134 |     31698 | 
             |     0.870 |     0.130 |     0.794 | 
-------------|-----------|-----------|-----------|
     unknown |      7770 |       429 |      8199 | 
             |     0.948 |     0.052 |     0.205 | 
-------------|-----------|-----------|-----------|
         yes |         3 |         0 |         3 | 
             |     1.000 |     0.000 |     0.000 | 
-------------|-----------|-----------|-----------|
Column Total |     35337 |      4563 |     39900 | 
-------------|-----------|-----------|-----------|

 

For 'Default'(credit in default?) variable, since they only have 3 clients with 'Yes' for deposit, it is literally hard to use for our modelling for analysis even if we do the Oversampling. So, I decided to drop this variable.

# Dropping the variable 'default'
bank_final = bank_final %>% 
  select(-default)

5) Housing

housing_y <- sqldf(
"select housing, total, deposit_yes, deposit_no, round(100.*deposit_yes/total, 2) as yes_ratio, round(100.*deposit_no/total, 2) as no_ratio
from(
select housing, count(*) as total, sum(case when y = 'yes' then 1 else 0 end) as deposit_yes, sum(case when y = 'no' then 1 else 0 end) as deposit_no 
from bank_final
group by 1
) order by yes_ratio DESC")

housing_y
  housing total deposit_yes deposit_no yes_ratio no_ratio
1     yes 20944        2468      18476     11.78    88.22
2 unknown   964         107        857     11.10    88.90
3      no 17992        1988      16004     11.05    88.95

we will just drop the 'unknown' level of housing variable since it doesn't give much special information.

# Dropping 'unknown' level from the variable 'housing'
bank_final = bank_final %>% filter(housing != "unknown")
bank_final$housing <- factor(bank_final$housing)
housing_v <- bank_final %>% 
  ggplot() +
  geom_mosaic(aes(x = product(y, housing), fill = y)) + mosaic_theme +
  xlab("housing loan") + ylab("Deposit Subscription") + scale_fill_brewer(palette="Set4")

ggplotly(housing_v)

Housing loan variable doesn't show that much big difference in Deposit Subscription ratio, still, the clients with housing loan are slightly more having Deposit subscription(yes).

6) Loan

bank_final$loan <- factor(bank_final$loan)
loan_y <- sqldf(
"select loan, total, deposit_yes, deposit_no, round(100.*deposit_yes/total, 2) as yes_ratio, round(100.*deposit_no/total, 2) as no_ratio
from(
select loan, count(*) as total, sum(case when y = 'yes' then 1 else 0 end) as deposit_yes, sum(case when y = 'no' then 1 else 0 end) as deposit_no 
from bank_final
group by 1
) order by yes_ratio DESC")

loan_y
  loan total deposit_yes deposit_no yes_ratio no_ratio
1   no 32882        3781      29101     11.50    88.50
2  yes  6054         675       5379     11.15    88.85
loan_v <- bank_final %>% 
  ggplot() +
  geom_mosaic(aes(x = product(y, loan), fill = y)) + mosaic_theme +
  xlab("Loan") + ylab("Deposit Subscription") + scale_fill_brewer(palette="Set2")

ggplotly(loan_v)

We have more clients with no Loan than with loan. However, deposit(yes) ratio according to Loan variable doesn't have big difference between Yes and No loan. It's similar. Therefore, the very first hypothesis we set in the beginning of this project => "Client with loan has less chance to subscribe a term deposit than the other." turns out to be kind of true but only with very slight difference from clients who doesn't have loans. So, let me conclude that loan and deposit subscription doesn't show that much relation in here.

8) Month

month_y <- sqldf(
"select month, total, deposit_yes, deposit_no, round(100.*deposit_yes/total, 2) as yes_ratio, round(100.*deposit_no/total, 2) as no_ratio
from(
select month, count(*) as total, sum(case when y = 'yes' then 1 else 0 end) as deposit_yes, sum(case when y = 'no' then 1 else 0 end) as deposit_no 
from bank_final
group by 1
) order by yes_ratio DESC")

month_y
   month total deposit_yes deposit_no yes_ratio no_ratio
1    mar   521         268        253     51.44    48.56
2    dec   174          85         89     48.85    51.15
3    sep   545         244        301     44.77    55.23
4    oct   686         302        384     44.02    55.98
5    apr  2549         519       2030     20.36    79.64
6    jun  4851         534       4317     11.01    88.99
7    aug  5829         622       5207     10.67    89.33
8    nov  4006         405       3601     10.11    89.89
9    jul  6672         618       6054      9.26    90.74
10   may 13103         859      12244      6.56    93.44
bank_final$month <- factor(bank_final$month, levels=c("jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec"))

month_vis <- ggplot(bank_final, aes(month, fill=y)) + geom_bar(position="stack") + scale_fill_brewer(palette="Set2") + ylab("Deposit(yes/no)") + ggtitle("Distribution of Client and Deposit by Last contact Month")

ggplotly(month_vis)
month_y$month <- factor(month_y$month, levels = month_y$month[order(-month_y$yes_ratio)])

month_vis2 <- ggplot(month_y, aes(x=month, y=yes_ratio)) +
  geom_bar(width=0.5, stat = "identity", fill="#fb8d61") +
  labs(title="Deposit(yes) ratio by Last Contact Month") +
  theme(axis.text.x = element_text(angle = 90))

ggplotly(month_vis2)

The month that clients are mostly contacted as the last contact is May, July, August, Jun as you can see in the first plot. (there is no contact at all on Jan and Feb) However, the customer with high Deposit(yes) ratio is ironically Mar, Dec, Sept, Oct. It is actually very intersting that this plot result shows "Calling a lot of customers don't gurantee high possibility of getting Deposit Subscription from clients. Rather, the month that the bank contacted very less customers turns out having high return on Deposit Subscription!". So, it's important to keep this insight in our mind, and check if this month variable has strong influence on our target variable in our model.

9) Day of Week

weekday_y <- sqldf(
"select day_of_week, total, deposit_yes, deposit_no, round(100.*deposit_yes/total, 2) as yes_ratio, round(100.*deposit_no/total, 2) as no_ratio
from(
select day_of_week, count(*) as total, sum(case when y = 'yes' then 1 else 0 end) as deposit_yes, sum(case when y = 'no' then 1 else 0 end) as deposit_no 
from bank_final
group by 1
) order by yes_ratio DESC")

weekday_y
  day_of_week total deposit_yes deposit_no yes_ratio no_ratio
1         thu  8106         997       7109     12.30    87.70
2         tue  7645         909       6736     11.89    88.11
3         wed  7714         917       6797     11.89    88.11
4         fri  7393         815       6578     11.02    88.98
5         mon  8078         818       7260     10.13    89.87
bank_final$day_of_week <- factor(bank_final$day_of_week, levels=c("mon", "tue", "wed", "thu", "fri"))

weekday_vis <- ggplot(bank_final, aes(day_of_week, fill=y)) + geom_bar(position="stack") + ylab("Deposit(yes/no)") + ggtitle("Distribution of Client and Deposit by Day of Week")

ggplotly(weekday_vis)
weekday_y$day_of_week <- factor(weekday_y$day_of_week, levels = weekday_y$day_of_week[order(-weekday_y$yes_ratio)])

weekday_vis2 <- ggplot(weekday_y, aes(x=day_of_week, y=yes_ratio)) +
  geom_bar(width=0.5, stat = "identity", fill="#2ebfc4") +
  labs(title="Deposit(yes) ratio by Day of Week") +
  theme(axis.text.x = element_text(angle = 90))

ggplotly(weekday_vis2)

The result shows that there are certain week days that have more positive response on Deposit subscription such as Thursday, Tuesday, Wednesday. However, in overall, there are not much prominent gap of Deposit (yes/no) between each week day.

10) Poutcome

poutcome_y <- sqldf(
"select poutcome, total, deposit_yes, deposit_no, round(100.*deposit_yes/total, 2) as yes_ratio, round(100.*deposit_no/total, 2) as no_ratio
from(
select poutcome, count(*) as total, sum(case when y = 'yes' then 1 else 0 end) as deposit_yes, sum(case when y = 'no' then 1 else 0 end) as deposit_no 
from bank_final
group by 1
) order by yes_ratio DESC")

poutcome_y
     poutcome total deposit_yes deposit_no yes_ratio no_ratio
1     success  1320         863        457     65.38    34.62
2     failure  4101         572       3529     13.95    86.05
3 nonexistent 33515        3021      30494      9.01    90.99
poutcome_vis <- bank_final %>% 
  ggplot() +
  geom_mosaic(aes(x = product(y, poutcome), fill = y)) + mosaic_theme +
  xlab("previous campaign outcome") + ylab("Deposit Subscription") + scale_fill_brewer(palette="Set2")

ggplotly(poutcome_vis)

Based on the above results, we can conclude that the clients who had contacted for 'previous campaign' at least once have more chance to positively respond and subscribe 'Deposit'. Even if the previous outcome was failure, it still shows higher deposit(yes) ratio than the clients who have never been contacted for marketing.

Numerical Variables

For numerical variable, we will not check every each of them, but rather focus on some variables with important insights.

1) Age

age_y <- sqldf(
"select age, total, deposit_yes, deposit_no, round(100.*deposit_yes/total, 2) as yes_ratio, round(100.*deposit_no/total, 2) as no_ratio
from(
select age, count(*) as total, sum(case when y = 'yes' then 1 else 0 end) as deposit_yes, sum(case when y = 'no' then 1 else 0 end) as deposit_no 
from bank_final
group by 1
) order by yes_ratio DESC")

age_y
   age total deposit_yes deposit_no yes_ratio no_ratio
1   87     1           1          0    100.00     0.00
2   89     2           2          0    100.00     0.00
3   98     2           2          0    100.00     0.00
4   92     4           3          1     75.00    25.00
5   86     7           5          2     71.43    28.57
6   77    19          12          7     63.16    36.84
7   82    16          10          6     62.50    37.50
8   80    30          17         13     56.67    43.33
9   79    13           7          6     53.85    46.15
10  76    32          17         15     53.13    46.88
11  78    25          13         12     52.00    48.00
12  66    49          25         24     51.02    48.98
 [ reached 'max' / getOption("max.print") -- omitted 66 rows ]
boxplot(bank_final$age, main="Age plot", yaxt="n", xlab="age", horizontal=TRUE)

ggplot(bank_final,aes(x=bank_final$age,fill=bank_final$y)) + geom_histogram(binwidth=1) +
  labs(y= "Count", x="Age", title = "Age")

Based on the plots above, we can think of 2 marketing campaign strategies. 1) targetting mid-young~mid aged people like 30s-40s.; 2) Think of the strategies to target aged clients or very young clients like 20s. Especially, I think it's good to target young clients in 20s because the Deposit distribution shows that clients from 25 y.o. are starting to increase in subscribing Deposit. And retaining young clients are beneficial to the bank since they can turn into the long-term loyal clients once the rapport is built.

Anyway, I will tranform this Age variable(numeric) into categorical variable with grouping age range as 'Low/Mid/High'

bank_final = bank_final %>% 
  mutate(age = if_else(age > 60, "high", if_else(age > 30, "mid", "low")))
bank_final$age <- factor(bank_final$age)

Histogram plot of all numeric variables in overall

bank_final %>%
  keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram()

Based on what we see in this histogram plot, we can check the variable campaign and pdays in detail since pdays somehow shows outlier distribution.

2) Campaign

campaign_vis <- bank_final %>% 
  ggplot() +
  geom_mosaic(aes(x = product(y, campaign), fill = y)) + mosaic_theme +
  xlab("number of contacts during this campaign") + ylab("Deposit Subscription") + scale_fill_brewer(palette="Set2")

ggplotly(campaign_vis)

We can see that too much campaign contacting to the client rather decreases the Deposit Subscription. (like more than 5 times). So, it seems like it's adequate to do campaign contact only below 5 times at maximum.

3) Pdays

ggplot(data = bank_final, aes(x=bank_final$pdays))+
  geom_histogram(bins=30)+
  labs(x="Number of days since last contact", y="Count")

If you see the pdays distribution, it is absolutely showing the distribution that is hard to use for our analysis since most of the values are 999 which means they have not previously contacted. This happened maybe because we dropped some variables and unknown levels in the previous steps while doing EDA. Anyway, we already have very similar variable called 'previous' which shows if the client was contacted last time(before this campaign) or not. So, we can just drop this pdays variable.

bank_final = bank_final %>% 
  select(-pdays)

3-2. Correlation

Correlation plot of numeric variables.

bank_num <-select_if(bank_final, is.numeric) # only selecting numeric variables

bank_num %>%
  cor() %>%
  corrplot(method = "number",
           type = "lower", 
           tl.cex = 0.8,
           tl.srt = 45)

We can see there is high correlation between euribor3m and emp.var.rate / cons.price.idx and emp.var.rate / emp.var.rate and nr.employed / euribor3m and nr.employed. Most variables in social, economic attribute variables show high correlation each other and it's of course thing. So, in here, I will drop the variable emp.var.rate and euribor3m which seems having high correlation with 2-3 other variables at the same time.

bank_final = bank_final %>% 
  select(-emp.var.rate)
bank_final = bank_final %>% 
  select(-euribor3m)

Let's check the correlation again.

bank_num <-select_if(bank_final, is.numeric) # only selecting numeric variables

bank_num %>%
  cor() %>%
  corrplot(method = "number",
           type = "lower", 
           tl.cex = 0.8,
           tl.srt = 45)

Now, it's okay.

4. Data Preparation for Modelling (final pre-processing)

Categorical variable to 1/0 level, Dummy variable

From now on, we will do the last pre-processing to prepare our data ideal for modelling. To use several machine learning models, it is important to transform all of our categorical variables into numerically expresssed variables. To do so, we will change variables with Yes/No 2 levels into 1/0 and variables with more than 2 levels into dummy variables.

# month
bank_final$month <- ifelse(bank_final$month == 'jan', 1,
                ifelse(bank_final$month == 'feb', 2,
                ifelse(bank_final$month == 'mar', 3,
                ifelse(bank_final$month == 'apr', 4,
                ifelse(bank_final$month == 'may', 5,
                ifelse(bank_final$month == 'jun', 6,
                ifelse(bank_final$month == 'jul', 7,
                ifelse(bank_final$month == 'aug', 8,
                ifelse(bank_final$month == 'sep', 9,
                ifelse(bank_final$month == 'oct', 10,
                ifelse(bank_final$month == 'nov', 11, 12)))))))))))

# day_of_week
bank_final$day_of_week <- ifelse(bank_final$day_of_week == 'mon', 1,
                ifelse(bank_final$day_of_week == 'tue', 2,
                ifelse(bank_final$day_of_week == 'wed', 3,
                ifelse(bank_final$day_of_week == 'thu', 4, 5))))
# variables with 2 levels into 1/0
bank_final$housing <- ifelse(bank_final$housing == 'no', 0, 1)
bank_final$loan <- ifelse(bank_final$loan == 'no', 0, 1)
bank_final$contact <- ifelse(bank_final$contact == 'cellular', 0, 1) #contact has only 2 levels with cellular and telephone
bank_final$previous <- ifelse(bank_final$previous == '0', 0, 1)
bank_final$y <- ifelse(bank_final$y == 'no', 0, 1)
# variables with more than 2 levels into dummy variables
bank_dummies <- dummy_cols(.data=bank_final, select_columns = c("age", "job", "marital", "education", "poutcome"), remove_first_dummy = FALSE)
bank_dummies
  age       job marital education housing loan contact month day_of_week
1 mid housemaid married  basic.4y       0    0       1     5           1
  campaign previous    poutcome cons.price.idx cons.conf.idx nr.employed y
1        1        0 nonexistent         93.994         -36.4        5191 0
  age_high age_low age_mid job_admin. job_blue-collar job_entrepreneur
1        0       0       1          0               0                0
  job_housemaid job_management job_retired job_self-employed job_services
1             1              0           0                 0            0
  job_student job_technician job_unemployed marital_divorced marital_married
1           0              0              0                0               1
  marital_single education_basic.4y education_basic.6y education_basic.9y
1              0                  1                  0                  0
  education_high.school education_professional.course
1                     0                             0
  education_university.degree poutcome_failure poutcome_nonexistent
1                           0                0                    1
  poutcome_success
1                0
 [ reached 'max' / getOption("max.print") -- omitted 38935 rows ]
bank_ready = bank_dummies[, -c(1,2,3,4,12)]
bank_ready <- bank_ready[c(11, 1:10, 12:37)]
str(bank_ready)
'data.frame':   38936 obs. of  37 variables:
 $ y                            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ housing                      : num  0 0 1 0 0 0 0 0 1 1 ...
 $ loan                         : num  0 0 0 0 1 0 0 0 0 0 ...
 $ contact                      : num  1 1 1 1 1 1 1 1 1 1 ...
 $ month                        : num  5 5 5 5 5 5 5 5 5 5 ...
 $ day_of_week                  : num  1 1 1 1 1 1 1 1 1 1 ...
 $ campaign                     : int  1 1 1 1 1 1 1 1 1 1 ...
 $ previous                     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ cons.price.idx               : num  94 94 94 94 94 ...
 $ cons.conf.idx                : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
 $ nr.employed                  : num  5191 5191 5191 5191 5191 ...
 $ age_high                     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ age_low                      : int  0 0 0 0 0 0 0 0 1 1 ...
 $ age_mid                      : int  1 1 1 1 1 1 1 1 0 0 ...
 $ job_admin.                   : int  0 0 0 1 0 0 1 0 0 0 ...
 $ job_blue-collar              : int  0 0 0 0 0 0 0 1 0 0 ...
 $ job_entrepreneur             : int  0 0 0 0 0 0 0 0 0 0 ...
 $ job_housemaid                : int  1 0 0 0 0 0 0 0 0 0 ...
 $ job_management               : int  0 0 0 0 0 0 0 0 0 0 ...
 $ job_retired                  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ job_self-employed            : int  0 0 0 0 0 0 0 0 0 0 ...
 $ job_services                 : int  0 1 1 0 1 1 0 0 0 1 ...
 $ job_student                  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ job_technician               : int  0 0 0 0 0 0 0 0 1 0 ...
 $ job_unemployed               : int  0 0 0 0 0 0 0 0 0 0 ...
 $ marital_divorced             : int  0 0 0 0 0 0 0 0 0 0 ...
 $ marital_married              : int  1 1 1 1 1 1 1 1 0 0 ...
 $ marital_single               : int  0 0 0 0 0 0 0 0 1 1 ...
 $ education_basic.4y           : int  1 0 0 0 0 0 0 0 0 0 ...
 $ education_basic.6y           : int  0 0 0 1 0 0 0 0 0 0 ...
 $ education_basic.9y           : int  0 0 0 0 0 1 0 0 0 0 ...
 $ education_high.school        : int  0 1 1 0 1 0 0 0 0 1 ...
 $ education_professional.course: int  0 0 0 0 0 0 1 0 1 0 ...
 $ education_university.degree  : int  0 0 0 0 0 0 0 1 0 0 ...
 $ poutcome_failure             : int  0 0 0 0 0 0 0 0 0 0 ...
 $ poutcome_nonexistent         : int  1 1 1 1 1 1 1 1 1 1 ...
 $ poutcome_success             : int  0 0 0 0 0 0 0 0 0 0 ...

Finally, our dataset is ready with numeric variables.

write.csv(bank_ready, "/Users/jiyoungkim/Desktop/SGH 2 sem/SLM class/Project/bank_final.csv")

Data Splitting (Train:Test=70:30)

require(caTools)
set.seed(1)
# Train and Test
sample = sample.split(bank_ready, SplitRatio = 0.7)
bank_train = subset(bank_ready, sample==TRUE)
bank_test = subset(bank_ready, sample==FALSE)

SMOTE (method to deal with imbalanced data)

We will oversample the minority group(y=1) using SMOTE method.

prop.table(table(bank_train$y)) # distribution of deposit y 0/1 in our train set

        0         1 
0.8858185 0.1141815 
bank_train_smote <- bank_train
bank_train_smote$y = as.factor(bank_train_smote$y)
bank_train_smote <- SMOTE(form = y~., data = bank_train_smote, perc.over = 100, perc.under =200)

prop.table(table(bank_train_smote$y)) # check distribution of deposit y 0/1 in our train set after SMOTE (now, it's 50:50)

  0   1 
0.5 0.5 
bank_test$y = as.factor(bank_test$y)

5. Modelling

# Modelling Pipeline we need for visualization of Model Evaluation
# plotting importance from predictive models into two panels
fun_imp_ggplot_split = function(model){
  # model: model used to plot variable importances
  
  if (class(model)[1] == "ranger"){
    imp_df = model$variable.importance %>% 
      data.frame("Overall" = .) %>% 
      rownames_to_column() %>% 
      rename(variable = rowname) %>% 
      arrange(-Overall)
  } else {
    imp_df = varImp(model) %>%
      rownames_to_column() %>% 
      rename(variable = rowname) %>% 
      arrange(-Overall)
  }
  
  # first panel (half most important variables)
  gg1 = imp_df %>% 
    slice(1:floor(nrow(.)/2)) %>% 
    ggplot() +
    aes(x = reorder(variable, Overall), weight = Overall, fill = -Overall) +
    geom_bar() +
    coord_flip() +
    xlab("Variables") +
    ylab("Importance") +
    theme(legend.position = "none")
  
  imp_range = ggplot_build(gg1)[["layout"]][["panel_params"]][[1]][["x.range"]]
  imp_gradient = scale_fill_gradient(limits = c(-imp_range[2], -imp_range[1]),
                                     low = "#132B43", 
                                     high = "#56B1F7")
  
  # second panel (less important variables)
  gg2 = imp_df %>% 
    slice(floor(nrow(.)/2)+1:nrow(.)) %>% 
    ggplot() +
    aes(x = reorder(variable, Overall), weight = Overall, fill = -Overall) +
    geom_bar() +
    coord_flip() +
    xlab("") +
    ylab("Importance") +
    theme(legend.position = "none") +
    ylim(imp_range) +
    imp_gradient
  
  # arranging together
  gg_both = plot_grid(gg1 + imp_gradient,
                      gg2)
  
  return(gg_both)
}

# plotting two performance measures
fun_gg_cutoff = function(score, obs, measure1, measure2) {
  # score: predicted scores
  # obs: real classes
  # measure1, measure2: which performance metrics to plot
  
  predictions = prediction(score, obs)
  performance1 = performance(predictions, measure1)
  performance2 = performance(predictions, measure2)
  
  df1 = data.frame(x = performance1@x.values[[1]],
                   y = performance1@y.values[[1]],
                   measure = measure1,
                   stringsAsFactors = F) %>% 
    drop_na()
  df2 = data.frame(x = performance2@x.values[[1]],
                   y = performance2@y.values[[1]],
                   measure = measure2,
                   stringsAsFactors = F) %>% 
    drop_na()
  
  # df contains all the data needed to plot both curves
  df = df1 %>% 
    bind_rows(df2)
  
  # extracting best cut for each measure
  y_max_measure1 = max(df1$y, na.rm = T)
  x_max_measure1 = df1[df1$y == y_max_measure1, "x"][1]
  
  y_max_measure2 = max(df2$y, na.rm = T)
  x_max_measure2 = df2[df2$y == y_max_measure2, "x"][1]
  
  txt_measure1 = paste("Best cut for", measure1, ": x =", round(x_max_measure1, 3))
  txt_measure2 = paste("Best cut for", measure2, ": x =", round(x_max_measure2, 3))
  txt_tot = paste(txt_measure1, "\n", txt_measure2, sep = "")
  
  # plotting both measures in the same plot, with some detail around.
  gg = df %>% 
    ggplot() +
    aes(x = x,
        y = y,
        colour = measure) +
    geom_line() +
    geom_vline(xintercept = c(x_max_measure1, x_max_measure2), linetype = "dashed", color = "gray") +
    geom_hline(yintercept = c(y_max_measure1, y_max_measure2), linetype = "dashed", color = "gray") +
    labs(caption = txt_tot) +
    theme(plot.caption = element_text(hjust = 0)) +
    xlim(c(0, 1)) +
    ylab("") +
    xlab("Threshold")
  
  return(gg)
}

# creating classes according to score and cut
fun_cut_predict = function(score, cut) {
  # score: predicted scores
  # cut: threshold for classification
  
  classes = score
  classes[classes > cut] = 1
  classes[classes <= cut] = 0
  classes = as.factor(classes)
  
  return(classes)  
}

# computing AUPR
aucpr = function(obs, score){
  # obs: real classes
  # score: predicted scores
  
  df = data.frame("pred" = score,
                  "obs" = obs)
  
  prc = pr.curve(df[df$obs == 1, ]$pred,
                 df[df$obs == 0, ]$pred)
  
  return(prc$auc.davis.goadrich)
}

# plotting PR curve
gg_prcurve = function(df) {
  # df: df containing models scores by columns and the last column must be
  #     nammed "obs" and must contain real classes.
  
  # init
  df_gg = data.frame("v1" = numeric(), 
                     "v2" = numeric(), 
                     "v3" = numeric(), 
                     "model" = character(),
                     stringsAsFactors = F)
  
  # individual pr curves
  for (i in c(1:(ncol(df)-1))) {
    x1 = df[df$obs == 1, i]
    x2 = df[df$obs == 0, i]
    prc = pr.curve(x1, x2, curve = T)
    
    df_prc = as.data.frame(prc$curve, stringsAsFactors = F) %>% 
      mutate(model = colnames(df)[i])
    
    # combining pr curves
    df_gg = bind_rows(df_gg,
                      df_prc)
    
  }
  
  gg = df_gg %>% 
    ggplot() +
    aes(x = V1, y = V2, colour = model) +
    geom_line() +
    xlab("Recall") +
    ylab("Precision")
  
  return(gg)
}

5-1) Logistic Regression

Let's start building logistic regression model first.

logistic <- glm(y~., data=bank_train_smote, family="binomial")
summary(logistic)

Call:
glm(formula = y ~ ., family = "binomial", data = bank_train_smote)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.8234  -0.8749  -0.1629   0.9241   2.0497  

Coefficients: (6 not defined because of singularities)
                                Estimate Std. Error z value Pr(>|z|)    
(Intercept)                   35.3904625  4.3147153   8.202 2.36e-16 ***
housing                       -0.0584067  0.0432331  -1.351 0.176704    
loan                          -0.1612413  0.0607822  -2.653 0.007983 ** 
contact                       -0.7896574  0.0661515 -11.937  < 2e-16 ***
month                          0.0002514  0.0127347   0.020 0.984251    
day_of_week                    0.0263697  0.0155738   1.693 0.090415 .  
campaign                      -0.0444649  0.0138909  -3.201 0.001370 ** 
previous                       1.2204701  0.1292557   9.442  < 2e-16 ***
cons.price.idx                 0.2475383  0.0532245   4.651 3.31e-06 ***
cons.conf.idx                  0.0125578  0.0052807   2.378 0.017403 *  
nr.employed                   -0.0111821  0.0004108 -27.218  < 2e-16 ***
age_high                       0.5511016  0.1645966   3.348 0.000813 ***
age_low                        0.1358729  0.0599678   2.266 0.023466 *  
age_mid                               NA         NA      NA       NA    
job_admin.                    -0.2050983  0.1397604  -1.467 0.142240    
 [ reached getOption("max.print") -- omitted 22 rows ]
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 16658  on 12015  degrees of freedom
Residual deviance: 13225  on 11985  degrees of freedom
AIC: 13287

Number of Fisher Scoring iterations: 5

We see some variables with high p-value(>0.05) which means not significant variable in our model. Therefore, we will create a model with only significant variables in the next step.

Choose appropriate Cut/Threshold

Before getting into the model evaluation part, let's think of what metrics are proper for us to check the model performance in this telemarketing business problem. For this situation, the most important priority is 'if we can precisely find clients who are willing to subscribe to a Deposit product (TP=true positive)'. And at the same time, we also need to consider 'if we didn't miss potential clients who may actually subscribe for Deposit but classified as they will not' (FN=False Negative). Considering this point, I think F-score is an ideal metric we can use in here in that it is typically used to measure how accurate the 'classification' is made. When F-score high, the Precision and Recall is also high. Therefore, I will more focus on the F score(f1 score) which consider both of Precision and Recall than Accuracy.

logistic_train_score = predict(logistic, newdata = bank_train_smote, type = "response")
logistic_test_score = predict(logistic, newdata = bank_test, type = "response")
measure_train = fun_gg_cutoff(logistic_train_score, bank_train_smote$y, 
                              "acc", "f")
measure_train +
  geom_vline(xintercept = c(0.4, 0.5), 
             linetype = "dashed")

We can see the best cut(threshold) for both F score and Accuracy in training set prediction is 0.4.

Then, let's check real test set threshold.

logistic_train_cut = 0.4
measure_test = fun_gg_cutoff(logistic_test_score, bank_test$y, "acc", "f")
measure_test + geom_vline(xintercept = c(0.5, logistic_train_cut), linetype = "dashed")

Considering the amount of Tradeoff between Accuracy and F-score, it's better to choose our threshold as '0.68' for test set prediction. (because it shows the highest F1 score with quite high accuracy too.)

Confusion Matrix & Model Evaluation Metrics

logistic_test_class = fun_cut_predict(logistic_test_score, 0.68)
# matrix
logistic_test_confm = confusionMatrix(logistic_test_class, bank_test$y, 
                                      positive = "1",
                                      mode = "everything")
logistic_test_confm
Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 10364   755
         1   811   697
                                          
               Accuracy : 0.876           
                 95% CI : (0.8701, 0.8817)
    No Information Rate : 0.885           
    P-Value [Acc > NIR] : 0.9992          
                                          
                  Kappa : 0.4007          
                                          
 Mcnemar's Test P-Value : 0.1646          
                                          
            Sensitivity : 0.4800          
            Specificity : 0.9274          
         Pos Pred Value : 0.4622          
         Neg Pred Value : 0.9321          
              Precision : 0.4622          
                 Recall : 0.4800          
                     F1 : 0.4709          
             Prevalence : 0.1150          
         Detection Rate : 0.0552          
   Detection Prevalence : 0.1194          
      Balanced Accuracy : 0.7037          
                                          
       'Positive' Class : 1               
                                          

5-2) Logistic Regression with selected variables (Stepwise selection)

logistic2 <- step(logistic, direction = "both")
summary(logistic2)

Call:
glm(formula = y ~ loan + contact + day_of_week + campaign + previous + 
    cons.price.idx + cons.conf.idx + nr.employed + age_high + 
    age_low + job_admin. + job_entrepreneur + job_retired + job_services + 
    job_student + marital_divorced + education_basic.4y + education_basic.6y + 
    education_basic.9y + education_high.school + education_professional.course + 
    poutcome_failure, family = "binomial", data = bank_train_smote)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.8062  -0.8767  -0.1690   0.9229   2.0473  

Coefficients:
                                Estimate Std. Error z value Pr(>|z|)    
(Intercept)                   35.0189570  4.2976897   8.148 3.69e-16 ***
loan                          -0.1675055  0.0606105  -2.764  0.00572 ** 
contact                       -0.7883827  0.0616705 -12.784  < 2e-16 ***
day_of_week                    0.0268959  0.0155646   1.728  0.08399 .  
campaign                      -0.0437198  0.0138670  -3.153  0.00162 ** 
previous                       1.2216896  0.1289179   9.476  < 2e-16 ***
cons.price.idx                 0.2502793  0.0531471   4.709 2.49e-06 ***
cons.conf.idx                  0.0125415  0.0045088   2.782  0.00541 ** 
nr.employed                   -0.0111892  0.0003919 -28.553  < 2e-16 ***
age_high                       0.5355574  0.1644172   3.257  0.00112 ** 
age_low                        0.1513121  0.0568551   2.661  0.00778 ** 
job_admin.                    -0.1073273  0.0555672  -1.931  0.05342 .  
job_entrepreneur              -0.1765985  0.1198463  -1.474  0.14061    
job_retired                    0.3571561  0.1240757   2.879  0.00400 ** 
job_services                  -0.1466066  0.0858669  -1.707  0.08775 .  
 [ reached getOption("max.print") -- omitted 8 rows ]
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 16658  on 12015  degrees of freedom
Residual deviance: 13230  on 11993  degrees of freedom
AIC: 13276

Number of Fisher Scoring iterations: 5

As we can see above, through Stepwise variable selection, there are only significant variables left in our LR model (Except for eduaction_basic.4y)

logistic2_train_score = predict(logistic2, newdata = bank_train_smote, type = "response")
logistic2_test_score = predict(logistic2, newdata = bank_test, type = "response")
measure_train2 = fun_gg_cutoff(logistic2_train_score, bank_train_smote$y, "acc", "f")
measure_train2 + geom_vline(xintercept = c(0.4, 0.5), linetype = "dashed")

We can see the best cut(threshold) for both F score and Accuracy in training set prediction is 0.4.

Then, let's check real test set threshold.

logistic2_train_cut = 0.4
measure_test2 = fun_gg_cutoff(logistic2_test_score, bank_test$y, "acc", "f")
measure_test2 + geom_vline(xintercept = c(0.5, logistic2_train_cut), linetype = "dashed")

For test set prediction, the ideal threshold should be 0.67 considering it gives us the highest F-score with moderate accuracy.

Confusion Matrix & Model Evaluation Metrics

logistic2_test_class = fun_cut_predict(logistic2_test_score, 0.67)
# matrix
logistic2_test_confm = confusionMatrix(logistic2_test_class, bank_test$y, 
                                      positive = "1",
                                      mode = "everything")
logistic2_test_confm
Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 10305   733
         1   870   719
                                          
               Accuracy : 0.873           
                 95% CI : (0.8671, 0.8788)
    No Information Rate : 0.885           
    P-Value [Acc > NIR] : 0.9999847       
                                          
                  Kappa : 0.4009          
                                          
 Mcnemar's Test P-Value : 0.0006817       
                                          
            Sensitivity : 0.49518         
            Specificity : 0.92215         
         Pos Pred Value : 0.45249         
         Neg Pred Value : 0.93359         
              Precision : 0.45249         
                 Recall : 0.49518         
                     F1 : 0.47287         
             Prevalence : 0.11499         
         Detection Rate : 0.05694         
   Detection Prevalence : 0.12584         
      Balanced Accuracy : 0.70866         
                                          
       'Positive' Class : 1               
                                          

Comparing to the first full LR model, we can see that most of the metrics didn't change that much. Metrics like Sensitivity, F1 Score, Recall have increased a bit than the first full model.

5-3) Decision Tree

There are 4 representative parameters we choose in Decision Tree. cp(Complexity Parameter), minsplit(min of observations), xval(of closs validation), maxdepth(max depth of any node). In here, the lesser cp is, the complexity increases. Therefore, finding the optimal CP is important. Considering this, we will proceed parameter tuning for this cp.

set.seed(1004)
dtree1 <- rpart(y~., data = bank_train_smote, cp = 0.1^20) # full tree model 
rpart.plot(dtree1)

# Find the optimal CP
dtree1$cptable
             CP nsplit rel error    xerror        xstd
1  4.184421e-01      0 1.0000000 1.0234687 0.009120117
2  3.728362e-02      1 0.5815579 0.5815579 0.008285569
3  3.012650e-02      2 0.5442743 0.5545939 0.008167752
4  2.521638e-02      4 0.4840213 0.4841877 0.007815372
5  5.825566e-03      6 0.4335885 0.4339214 0.007520252
6  1.830892e-03     12 0.3974700 0.3981358 0.007285318
7  1.747670e-03     13 0.3956391 0.3974700 0.007280737
8  1.553484e-03     15 0.3921438 0.3963049 0.007272700
9  1.539614e-03     20 0.3839880 0.3916445 0.007240308
10 1.081891e-03     26 0.3740013 0.3839880 0.007186231
11 9.986684e-04     31 0.3673435 0.3814913 0.007168361
12 9.154461e-04     34 0.3643475 0.3771638 0.007137110
13 8.877053e-04     36 0.3625166 0.3761651 0.007129847
14 8.322237e-04     39 0.3598535 0.3759987 0.007128635
15 7.490013e-04     40 0.3590213 0.3761651 0.007129847
 [ reached getOption("max.print") -- omitted 26 rows ]
plotcp(dtree1)

cp_optimal <- dtree1$cptable[which(dtree1$cptable[,"xerror"]==min(dtree1$cptable[,"xerror"])),"CP"]
cp_optimal
          16           24 
0.0005825566 0.0003120839 

Somehow, 2 optimal values for cp came out (0.0005825566 and 0.0003120839), so i will use a bit larger one which is 0.0005825566 as our optimal cp.

dtree_optimal <- prune(dtree1, 0.0005825566) 
#Plot of prunned tree model with optimal cp
rpart.plot(dtree_optimal)

As you can see, we still have too many variables which are not significant actually. Therefore, we will build another one more decision tree model with filtered variables based on this importance plot as below.

fun_imp_ggplot_split(dtree_optimal)

tree_train_score = predict(dtree_optimal,newdata = bank_train_smote,type = "prob")[, 2]
tree_test_score = predict(dtree_optimal, newdata = bank_test,type = "prob")[, 2]

Choose appropriate Cut/Threshold

Trainset

measure_train = fun_gg_cutoff(tree_train_score, bank_train_smote$y,"acc", "f")
measure_train + geom_vline(xintercept = c(0.4, 0.5), linetype = "dashed")

You can see that the best cut for F-score is around 0.4 for training set' prediction.

Testset

tree_train_cut = 0.4
measure_test = fun_gg_cutoff(tree_test_score, bank_test$y, "acc", "f")
measure_test + geom_vline(xintercept = c(tree_train_cut, 0.5), linetype = "dashed")

In test set' prediction, best threshold figure for F-score is 0.6. Let's check the confusion matrix and metrics of this tree model.

Confusion Matrix & Model Evaluation Metrics

tree_test_class = fun_cut_predict(tree_test_score, 0.6)
tree_test_confm = confusionMatrix(tree_test_class, bank_test$y, positive = "1", mode = "everything")
tree_test_confm
Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 10069   587
         1  1106   865
                                          
               Accuracy : 0.8659          
                 95% CI : (0.8599, 0.8718)
    No Information Rate : 0.885           
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.4299          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.5957          
            Specificity : 0.9010          
         Pos Pred Value : 0.4389          
         Neg Pred Value : 0.9449          
              Precision : 0.4389          
                 Recall : 0.5957          
                     F1 : 0.5054          
             Prevalence : 0.1150          
         Detection Rate : 0.0685          
   Detection Prevalence : 0.1561          
      Balanced Accuracy : 0.7484          
                                          
       'Positive' Class : 1               
                                          

Comparing this Tree model to the last LR2 model, this tree model shows some better metrics in F1 score, Recall, and Sensitivity. Let's compare this model's performance with other models in the next step "Model Evaluation"

5-4) Decision Tree with selected variables (based on variable importance)

This time, I'll only include top 10 variables in terms of variable importance that we saw above.

set.seed(1004)
dtree2 <- rpart(y~nr.employed+cons.conf.idx+cons.price.idx+contact+month+poutcome_success+campaign+day_of_week+poutcome_failure+education_university.degree, data = bank_train_smote, cp = 0.1^20) # tree model with filtered variables based on their importance
rpart.plot(dtree2)

# Find the optimal CP
dtree2$cptable
             CP nsplit rel error    xerror        xstd
1  4.184421e-01      0 1.0000000 1.0234687 0.009120117
2  3.728362e-02      1 0.5815579 0.5815579 0.008285569
3  3.012650e-02      2 0.5442743 0.5545939 0.008167752
4  2.521638e-02      4 0.4840213 0.4841877 0.007815372
5  5.825566e-03      6 0.4335885 0.4339214 0.007520252
6  1.747670e-03     12 0.3974700 0.3981358 0.007285318
7  1.553484e-03     14 0.3939747 0.3959720 0.007270399
8  1.539614e-03     19 0.3858189 0.3911451 0.007236815
9  1.331558e-03     25 0.3758322 0.3858189 0.007199261
10 1.081891e-03     27 0.3731691 0.3791611 0.007151578
11 9.986684e-04     32 0.3665113 0.3748336 0.007120134
12 7.490013e-04     40 0.3576897 0.3681758 0.007071057
13 6.102974e-04     44 0.3546937 0.3651798 0.007048690
14 5.825566e-04     50 0.3510320 0.3615180 0.007021111
15 4.993342e-04     54 0.3487017 0.3601864 0.007011016
 [ reached getOption("max.print") -- omitted 14 rows ]
plotcp(dtree2)

cp_optimal <- dtree1$cptable[which(dtree2$cptable[,"xerror"]==min(dtree2$cptable[,"xerror"])),"CP"]
cp_optimal
[1] 0.0004993342
dtree2_optimal <- prune(dtree2, cp_optimal) 
#Plot of prunned tree model with optimal cp
rpart.plot(dtree2_optimal)

tree2_train_score = predict(dtree2_optimal, newdata = bank_train_smote,type = "prob")[, 2]
tree2_test_score = predict(dtree2_optimal, newdata = bank_test,type = "prob")[, 2]

Choose appropriate Cut/Threshold

Trainset

measure_train2 = fun_gg_cutoff(tree2_train_score, bank_train_smote$y,"acc", "f")
measure_train2 + geom_vline(xintercept = c(0.4, 0.5), linetype = "dashed")

You can see that the best cut for F-score is around 0.4 for training set' prediction.

Testset

tree2_train_cut = 0.4
measure_test2 = fun_gg_cutoff(tree2_test_score, bank_test$y, "acc", "f")
measure_test2 + geom_vline(xintercept = c(tree2_train_cut, 0.5), linetype = "dashed")

In test set' prediction, best threshold figure for F-score is 0.7. Let's check the confusion matrix and metrics of this tree model.

Confusion Matrix & Model Evaluation Metrics

tree2_test_class = fun_cut_predict(tree2_test_score, 0.7)
tree2_test_confm = confusionMatrix(tree2_test_class, bank_test$y, positive = "1", mode = "everything")
tree2_test_confm
Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 10314   666
         1   861   786
                                          
               Accuracy : 0.8791          
                 95% CI : (0.8733, 0.8847)
    No Information Rate : 0.885           
    P-Value [Acc > NIR] : 0.9819          
                                          
                  Kappa : 0.4386          
                                          
 Mcnemar's Test P-Value : 6.885e-07       
                                          
            Sensitivity : 0.54132         
            Specificity : 0.92295         
         Pos Pred Value : 0.47723         
         Neg Pred Value : 0.93934         
              Precision : 0.47723         
                 Recall : 0.54132         
                     F1 : 0.50726         
             Prevalence : 0.11499         
         Detection Rate : 0.06225         
   Detection Prevalence : 0.13043         
      Balanced Accuracy : 0.73214         
                                          
       'Positive' Class : 1               
                                          

Comparing to the previous DT1 Model with full variable, this one shows only better metrics in Accuracy, Sensitivity, Specificity, Precision and F1 score. However, in overall, nothing much severe gap between these two models.

5-5) K-nearest Neighbor

Last model is KNN, a non-parametric classification method.

knn <- train(y~., data=bank_train_smote, method ="knn", trControl = trainControl("cv", number = 5), preProcess = c("range"), tuneLength=10)
plot(knn)

pred.knn <- predict(knn, newdata=bank_test)
### Confusion matrix
knn_test_confm = confusionMatrix(pred.knn, bank_test$y, positive = "1", mode = "everything")
knn_test_confm
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 7807  562
         1 3368  890
                                          
               Accuracy : 0.6888          
                 95% CI : (0.6806, 0.6968)
    No Information Rate : 0.885           
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.1693          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.61295         
            Specificity : 0.69861         
         Pos Pred Value : 0.20902         
         Neg Pred Value : 0.93285         
              Precision : 0.20902         
                 Recall : 0.61295         
                     F1 : 0.31173         
             Prevalence : 0.11499         
         Detection Rate : 0.07048         
   Detection Prevalence : 0.33721         
      Balanced Accuracy : 0.65578         
                                          
       'Positive' Class : 1               
                                          

Based on this result, let's compare all models together in the next Step.

6. Model Evaluation

ROC Curve & Precision-Recall AUC Curve

score_test = data.frame("Logistic Regression Full" = logistic_test_score,
                        "Logistic Regression Stepwise" = logistic2_test_score,
                        "Decision Tree Full" = tree_test_score,
                        "Decision Tree Var selected(imp)" = tree2_test_score,
                        "obs" = as.numeric(bank_test$y) - 1)


roc_test = score_test %>%
  gather(key = "Method", value = "score", -obs) %>% 
  ggplot() +
  aes(d = obs,
      m = score,
      color = Method) +
  geom_roc(labels = F, pointsize = 0, size = 0.6) +
  xlab("Specificity") +
  ylab("Sensitivity") +
  ggtitle("ROC Curve", subtitle = "Validation dataset")

prcurve_test = gg_prcurve(score_test) + ggtitle("PR Curve", subtitle = "Validation dataset")

curves_test = ggarrange(roc_test, prcurve_test, 
                        common.legend = T,
                        legend = "bottom")
print(curves_test)

I wanted to show the knn model at the same time on the plot, but not available to drive ROC, AUC attribute from knn model due to the internal error.

df_final = data.frame("Model" = c("Logistic Regression Full",
                                  "Logistic Regression Stepwise",
                                  "Decision Tree Full",
                                  "Decision Tree Var selected(imp)",
                                  "KNN"),
                      "AUROC" = c(auc(bank_test$y, logistic_test_score),
                                  auc(bank_test$y, logistic2_test_score),
                                  auc(bank_test$y, tree_test_score),
                                  auc(bank_test$y, tree2_test_score),
                                  NA),
                      "AUPR" = c(aucpr(bank_test$y, logistic_test_score),
                                 aucpr(bank_test$y, logistic2_test_score),
                                 aucpr(bank_test$y, tree_test_score),
                                 aucpr(bank_test$y, tree2_test_score), NA),
                      "Cut" = c(0.68, 0.67, 0.6, 0.7,NA),
                      "Accuracy" = c(logistic_test_confm[["overall"]][["Accuracy"]],
                                     logistic2_test_confm[["overall"]][["Accuracy"]],
                                     tree_test_confm[["overall"]][["Accuracy"]],
                                     tree2_test_confm[["overall"]][["Accuracy"]], 0.6896),
                      "F1" = c(logistic_test_confm[["byClass"]][["F1"]],
                               logistic2_test_confm[["byClass"]][["F1"]],
                               tree_test_confm[["byClass"]][["F1"]],
                               tree2_test_confm[["byClass"]][["F1"]], 0.31180),
                      stringsAsFactors = F)
df_final %>% 
  arrange(-F1)
                            Model     AUROC      AUPR  Cut  Accuracy        F1
1 Decision Tree Var selected(imp) 0.7916969 0.4324922 0.70 0.8790687 0.5072604
2              Decision Tree Full 0.7913294 0.4320732 0.60 0.8659222 0.5054046
3    Logistic Regression Stepwise 0.7762411 0.4381652 0.67 0.8730498 0.4728708
4        Logistic Regression Full 0.7759985 0.4369693 0.68 0.8759800 0.4709459
5                             KNN        NA        NA   NA 0.6896000 0.3118000

(*AUROC = area under the ROC curve, AUPR = area under the precision-recall curve)

Based on the above results, we can say the Decision Tree model with Variable selected using variable importance has the highes performance in terms of F1 score and Accuracy. As we discussed at the beginning of modelling, our ideal metrics to evaluate model's performance were Precision, Recall and F1 Score than the Accuracy itself. Becuase telemarketing business is an area in need to precisely target the right clients as possible but at the same time, not losing the potential clients.
Therefore, we can say the best model is Decision Tree model with selected variables.

7. Conclusion / Summary

Through whole process of this bank telemarketing data analysis project, I could realize the importance of 'targeting' and 'segmenting' in the marketing campaign. Especially through EDA part, I can gain a lot of insights regarding what can be the important and meaningful factor of sucessful bank telemarketing.
There were several challenges I encountered during the project as follows:
1) There were quite large amount of "unknown" value, so I had to drop some of them. It was not that much sacrifice since I checked distribution of 'Deposit(y)' for each variables before I drop them. Still, it might be better to have less unknown values in the dataset for the better analysis.;
2) Since our target variable 'y' was so imbalanced as 0:1 = 89%:11%, the original data itself is not that good for analysis. To deal with this issue, I did SMOTE(Synthetic Minority Oversampling Technique). However, this method still has potential risk of overfitting due to over sampling the minority. Luckily, in our modelling part, I did not identify any severe overfitting, but still, SMOTE itself is not the perfect method for dealing with imbalanced data.;
3) By looking back on the process, I realized that I couldn't drive good insights or use social & economic variables (which were mostly numeric) in our analysis. It might be better if I could also check those variables in detail.